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DESIGN AND EVALUATION OF A FAULT-TOLERANT MULTIFROCESSOR 
USING HARDWARE RECOVERY BLOCKS 


Yann-Hang Lee and Kang G. Shin 


ABSTRACT 

In this paper we consider the design and the evaluation of a 
fault-tolerant multiprocessor with a rollback recovery mechanism. 

The rollback mechanism Is based on the hardware recovery block 
which Is a hardware equivalent to the software recovery block. The 
hardware recovery block Is constructed by consecutive state-save operations 
and several state-save units in every processor and memory module. When a 
fault is detected, the multiprocessor reconfigures itself to replace the faulty 
component and then the process originally assigned to the faulty component 
retreats to one of the previously saved states In order to resume fault-free 
execution. 


Due to random interactions among cooperating processes and also due 
to asynchrony in the state-savings, the rollback of a process may 
propagate to others and multiple-step rollbacks may thus bncome necessary. 
In the worst case, when all the available saved states are exhausted, the 
processes have to restart from the beginning as If they were executed In a 
system without any rollback recovery mechanism. A mathematical model is 
proposed to calcul&^o both the coverage of multi-step rollback recovery and 
the risk of restart. The performance evaluation In terms of the mean and 
variance of execution time of a given task is also presented. 


Index Terms - Fault-tolerant multiprocessor, rollback recovery, 
hard ware/ software recovery block, rollback propagation, 

coverage of recovery. 



1. INTRODUCTION 


There are numerous benefits to be gained from a multiprocessor. In 
addition to the decreasing of hardware cost and the inherent reliability of 
LSI components, the capacity of reconfiguration makes the multiprocessor 

I 

attractive when system reliability is important. It is particularly essential to 
critical real-time applications that the system be tolerant of failure with 

minimum time overhead and that the task be completed prior to the Imposed 

deadline, Hence, one of the major issues of reliable multiprocessor design is 
error recovery without having to restart the whole task when an error 
occurs, 

In general, the tolerance of failure during system operation is 
realized by three steps: detection of error, reconfiguration of system 

components, and recovery from error. The purpose of error detection is to 
recognize the erroneous state and to prevent a consequent failure of the 
system. There are two design approaches in error detection: (1) detect an 
error immediately, and (2) isolate the erroneous information before it is 
propagated. For the first approach, the most-widely used techniques are 

error detection/correction coding, addition of built-in checking circuits 
(e.g., voting hardware), etc. Error detection schemes such as consistency 
test, the execution of validation routines, or acceptance test are typical 
methods for the second approach. After the detection of an error, the 

faulty components, which are the source of error, are localized and 
replaced so as to enable the system to be operational again. To recover 
from an error, the rollback recovery method or the re-initialization of a 
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fault-free subsystem is usually invoked in order to resume the failed 
computation. Both methods consist of state restoration and recovery point 
establishment. In JPL-STAR system [1], the recovery points are defined by 
the application program which also takes the responsibility of compensating 
for the information prior to the recovery point. Hence Its error recovery 

capability is constructed in the application software level. On the other 
hand/ the strategies used in PLURIBUS [2] are to organize the hardware 
and software components Into reliable subsystems and to mask the error 
above the interface level of a subsy$\tem. When an error is detected/ the 

subsystem performs backward recovery by restarting the subsystem. 

The conventional restart recovery technique could be costly and inept 
since (1) the computation between the start of task and the time when an 

error is detected is lost/ and (2) If the task is distributed over different 

processing units in the multiprocessor/ It is difficult to provide a consistent 
task state and to isolate a subtask to prevent the propagation of erroneous 
information to others (these may lead to the restarting of the whole task 
and result In high re-initialization overhead). The rollback recovery method 
at the software level is also difficult to implement and may not be effective, 
especially for tightly coupled processes, since (1) the software recovery 
points in each process are not sufficient to recover the task unless they 
belong to the same recovery line [3], and (2) the program designers have 
to structure carefully the parallel processes so that the interacting 
processes establish recovery points in a well-coordinated manner. (This 
could become a heavy burden on the program designers). Several 
alternatives have been proposed; for example, the conversation scheme [4], 

2 



the Interprocess communication primitives in producer-consumer system [5], 
the programmer-transparent scheme [6,7], the system defined checkpoints 
[3], etc. These methods could lead to a loss of efficiency in the absence of 
error, the accumulation of a large amount of recorded states for heavy 
interprocess communications, or some undesirable restrictions in 

commu n {cation schemes . 

However, the concept of the recovery block, proposed by Randell 
[3,4], can still be useful for tolerating hardware faults in the 
multiprocessor. In this paper, we employ this concept to construct a 
hardware recovery block which enables the task to survive processor or 
memory failures . In general a process state can be regarded as the status 
of internal registers of the assigned processor and the process variables 
stored in memory. In order to resume a failed process, an error-free 

process state should be restored. The hardware recovery block is 
constructed in a quasi-synchronized manner which saves all states of a 
process consecutively and automatically. This happens in parallel with the 
execution of the process by using a special state-save mechanism 

implemented in hardware. Yhe hardware recovery block is different from the 
software recovery block which only saves non-local states when a check- 
point is encountered. Moveover, instead of the assertions in the check- 
point of the software recovery block, the hardware resources are tested by 
embedded checking circuits and self-test routines. After an error is 

detected and the faulty component is located, the system will be 
reconfigured to replace the failed hardware module. By loading the program 
code and by transferring the recorded states into the replacement module, 
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the original process can be resumed. 


The multiprocessor with a hardware recovery block scheme takes 
advantage of the large number of processor units available to facilitate fast 
recovery from hardware failures, Furthermore, the system minimizes the 
time required to et'^abllsh every recovery block that would significantly 
affect system performance. 

For both hardware and software recovery blocks, the rollback of the 
failed process to the previous state Is not sufficient for concurrent 
processing. The rollback of one process may propagate to other processes 
or to a further recorded state, (This is called rollback propagation ). The 
worst case is when an avalanche of rollback pv^pagations, nE.nely the 
domino effect, occurs. The domino effect is impossible to avoid If no 
limitation is placed on process interactions [8], Instead of placing any such 
limitations, several consecutive states are saved so that the processes are 
allowed to roll back multiple steps in case of rollback propagation. The 
coverage of a multi-step rollback, which indicates the probability of having 
a successful rollback recovery when the processes roll back multiple steps, 
should be examined to decide the effectiveness of this method. Both the 
recovery overhead and the computation loss resulted from this automatic 
rollback recovery mechanism should also be studied carefully. Furthermore, 
since the time interval between two consecutive state savings is related to 
the final performance figure of this method, the optimal value of this 
interval has to be determined. 


This paper Is divided into five sections, Since the construction of 
hardware recovery blocks In the multiprocessor plays a basic role, we 
review It briefly In Section 2, The detailed description can be found In 
[9,10], In this section, we also extend the previous design to a general 
multiprocessor on which our hardware fault recovery can be implemented. 
Section 3 presents an algorithm to detect rollback propagations among 
cooperating processes and also proposes a model to evaluate the coverage of 
multi-step rollback recovery. Section 4 uses the results of Section 3 and 
deals with ths analysis and estimation of performance in terms of the mean 
and variance of the task completion time. The conclusion follows In Section 
5. 


2. AUTOMATIC ROLLBACK MECHANISM FOR A MULTIPROCESSOR 

The multiprocessor under consideration has a general structure and 
consists of processor modules, interconnection network and/or common 
memory modules. To benefit from the locality of reference, every processor 
module owns its local memory which is accessible via a local bus. Every 
processor module can also access the shared memory through the 
interconnection network. First, the basic state-save mechanism associated 
with every processor module and common memory is briefly presented. Then 
we discuss the rollback recovery operations of a task for which the 
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following two multiprocessors can be used: in one, there is no common 
memory, but local memory of one processor module is accessible by other 
processor modules (e.g., Cm* system [11]); in the other, the system Is 
equipped with separated common memory modules [12] and restricts the 
access of local memory only to the resident processor* 


2.1 Processor Module, Common Memory, and State-save 

A basic processor module (PM) in the multiprocessor comprises a 

processor, a local memory, a local switch, state-save memory units (SSUs) 
and a monitor switch as shown in Fig* 1. It is assumed that a given task 
is decomposed Into processes each of which is then assigned to a processor 
module. The shared variables among these cooperating processes are located 
in the shared memory which is either separated common memory or local 

memories depending upon the multiprocessor structure, Thus each process 
in a PM can communicate with other processes (allocated to other PMs) 

through the shared variables. PMs save their states (i.e. process local 
variable and processor status) in an SSL) at various stages of execution; 
this operation is called a state-save. Ideally, it would be preferable to save 
states of all processes at the same instant during the execution of task. 
Because of the indivisibility and asynchrony of instruction execution in 
PMs, it is difficult to achieve this ideal case without forced synchronization 
and the consequent loss of (efficiency. In order to alleviate this problem, 

we employ a quasi-synchronized method in which an external clock sends all 
PMs a <,^’\te-save invocation signal at a regular interval, Tss. This 
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Invocation signal will stimulate every PM to save lt« states as soon as it 
completes the current Instruction and then to execute a validation tost. If 
the processor survives the test, the saved state would be regarded as the 
recovery point for the next Interval. If the processor falls the validation 
test or an error Is detected during execution of a piiocess, the system will 
be reconfigured to replace the faulty component and the associated process 
will roll back to one of the previously saved states. The detalPu operations 
of state saving and rollback recovery are shown In Fig. 2, 

Similarly to a processor module, each common memory module (CM) 

also contains state-save memory units and a monitor switch. These SSUs are 
ysed to record the updates of CM only. The access requests of CM are 

managed by an accejs queue on the basis of first-come-flrst-serve 

discipline, When a PM refers to a variable resident In a CM, an access 
request Is sent to the destination CM through the Interconr'^^ctlon network 
and enters the access queue associated with the CM. When all the 

preceding requests to this CM are completed, the access request will be 
honored and a reply will be sent back to the requesting PM. When a 

state-save Invocation Is issued, a state-save request is placed at the tall of 
every access queue. Thus the state-save In CM is performed when the 

requests made prior to the state-save invocation have been completely 

served. 

During a state-save Interval, besides the normal memory reference or 
Instruction execution, certain operations are automatically executed; for 

example, a parity check is done whenever a bus/memory is used. Some 
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redundant error dotefithn units also accompany the processor module [13], 
e.g,, dual-redundancy comparison, address-ln-bound check, etc. These 
units are expected to detect a malfunction whenever the corresponding 

function units are used, An additional validation process which could be the 
execution of self-test routine refreshes the shelters to guarantee that the 
saved state be correct and thus guards against the existing fault extending 
to the next state-save Interval. 

Suppose there are (N^l) state-save units for every PM (and every 

CM), called SSU.|, SSU 2 , ... SSU|^^.^, These units are used for saving 

states at (N+1) consecutive state-save Intervals, Thus each PM or CM is 
able to keep N valid states saved in N SSUs and record the currently 
changing state in the remaining SSU. As shown in Fig. 3, the SSU^, SSU 2 
, .. SSUj^ are so arranged to record the states for consecutive state-save 
intervals T(i) ,T(i+1) , ... T(i+N) and the is used to record the 

updates In the current state-save interval, T(i+N+1). To minimize the time 
overhead required for state-saving, the saving Is done concurrently with 
process execution. Every update of variables in the local memory is also 
directed to the current SSU. When a PM or CM moves to the next 
state-save interval, each used SSU will age one step and the oldest SSU 

will be changed to the current position if all SSUs are exhausted. The 
monitor switch Is used to route the updates to SSUs and to manage the 
aging of SSUs. Therefore the state-save mechanism of each PM or CM 
provides an N-step rollback capability. However, in Section 3, we will show 
that only a small number of SSUs are sufficient to establish high coverage 
of rollback recovery for a given task. 
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since the update of dynamic elements is recorded in only one SSU, 
the other SSUs are Ignorant of It. This fact may bring about a serious 
problem: the newly updated variables may be lost. In order to avoid 
It is necessary to make the contents of currently updated SSU identical 
with that of the memory or to copy the variables that have been changed 
in the previous intervals into the current SSU. A solution to this problem 
has been discussed In our previous paper [9]* At each state-switching 
instant, the current SSU contains not only the currently updated variables 
but also the previously updated variables. Consequently, the contents of 
the current SSU always represents the newest state of the PM or CM. 


2.2 f^ollback Recovery Operations of a Task 

As described in the above section, each processor module and common 
memory has its own rollback mechanism with several saved states. With 
these individual rollback recovery capabilities, the rollback recovery of a 
task is described as follows. 

Suppose a task is partitioned and then allocated to M modules i 
(i=l ,2, . . . ,M) , These modules include PMs and CMs and will be dedicated to 
this task until its completion. The state saving of a task implies the 
state-savings of these modules. The rollback of a process Is equivalent to 
the state restoration of the associated modules. Since the process state 
Includes the Internal hardware states, local variables and global variables, 
the resumption of a failed process may need cooperation from common 
memory and/or other processes. Moveover, due to arbitrary interactions 
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between cooperating processes and the asynchrony in state savings among 
them# the rollback of one process may cause others to roll back and It Is 
therefore possible to require a multi-step rollback (a detail of this will be 
discussed In the noKt section)# In order to make decision as t'S' rollback 
propagation and also to perform housekeeping jobs# (e,g., task allocation# 
Interconnection network arbitration# reconfiguration# etc,)# a system monitor 
and a switch controller are Included In the multiprocessor. The switch 
controller handles the global variables references and records these 
references for analyzing rollback propagation and multi-step rollback# The 
system monitor receives the task execution command and then allocates PMs 
and CMs to perform the task. Both devices are defined In a logical sense, 
They could be a host computer# a special monitor processor# or one of 
general processor modules in the system. 

To deal with the error recovery# the system monitor receives reports 
from each module about the state-save operations and Its conditions. Once 
an error Is detected# the system monitor will signsl "retry" to the module 
in question. If the error recurs, a permanent fault is declared and the 
following steps are taken by the system monitor and the switch controller. 

1, Stop all PMs that are executing processes of the task In question. 

2. Make a decision as to rollback propagation. 

3, Resume the execution of processes that are not affected by rollback 
propagation, 

4. Find a free module to replace the failed one. 
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5. Transfer the process or data In the failed module to the 

replacement modulo and reroute the path to map addresses directed 
to the faulty module Into Its replacement, 

6. Restore the previous states of this processes affected by the 

rollback of the process in the faulty modi/,le. 

7. Any interaction directed to a module to be restored must wait for 

the resumption of the module, Old and unservIced interactions 

issued by the rolled-back PMs, which are still queued in the access 
queues, are cancelled. 
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3. ROLLBACK PROPAGATION AND MULTI-STEP ROLLBACK 

In order to roll back a failed process, the consistent values of the 
process variables and the internal states of the associated PM should be 
provided. The local variables and internal states which are saved in the 
SSUs of a PM are easily obtainable. However, the shared variables which 
may be located in any arbitrary PM or CM and may be accessed by any 
arbitrary processes bring about a difficult problem: the rollback of a failed 
process induces the rollback of other processes, i.e,, rollback propagation 
occurs. The rollback propagation might result in another inconsistent state 
for certain processes. Therefore, a multi-step rollback is required. 

Fur^:hermore, the hardware may have latent faults which are 
undetectable until they induce some errors. In the following discussion, we 
assume that an error will be detected immediately when it occurs. So the 
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rollback propagation Is used only to obtain a consistent state. However, it 
can bo easily extended to the case In which error latency exists and Is 
bounded (by U) [14]; 

(1) , First obtain a consistent state which may require rollback 

propagations and calculate the total rollback distance, D, 

(2) , If D i the total computation done then restart 

else If D 2: U then done 
else go to step (1). 


3,1 Rollback Propagation and Multi-step Rollback 

In general rollback propagation can not be avoided If the processes 
interact with each other arbitrarily. For the organization of multiprocessor 
in the previous section, a process will be located to one PM and/or several 
CM's and each module has Its own rollback recovery mechanism. So each 
module can be regarded as an object for rollback propagation. Each 
interaction between cooperating processes is implemented as a memory 
reference to a shared variable. It is also regarded as a memory reference 
across the modules. To avoid having to trace every reference to the shared 
variables and to simplify the detection of rollback propagation, we assume 
that the failure of a particular module leads to the automatic rollback of all 
modules that have interacted with It during the current state-save interval. 
Let Pj-'‘>P| denote the rollback propagation in which the rollback of process 
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Pj Induce the state restoration In more than one modules and than induce 

the rollback of process Pj. An example Is presented In Figure 4, where 

process P.j fails at time t^, Since the interactions between P-j and ?2 exist 

1 

during the time Interval (t|^,t^), process ?2 must roll back to enable the 
Interaction for the resumption of P^, The rollback of P2 will propagate 
further to other processes; In this example, 

ft 

In the above example, we can find that the rollback of Pg and P2 to 
their most recently saved state still cannot provide a consistent state. 
(This requires a multi-step rollback). The reason that a single step 
rollback can not recover the process states is mainly due to the occurrence 
of references between the asynchronous state savings of interacting 

processes. Consider the cases in Figure 5. Suppose Pj rolls back because 
of failure or rollback propagation from another process. In case (a), the 
single step rollback of Pj Is sufficient to recover its state if there is no 

other rollback propagated to P. . In cases (b), (c), and (d), both Pj and P 
. have to roll back. Since there exists an interaction between the 

state-savings of P| and P|, rollback to further state Is necessary. A 

property related to the necessary condition for a successful rollback can be 
stated as follow: 

Property: When process Pj rolls back to the beginning of state-save 
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interval Tj(m)/ (process Pj may rolls back n steps to reach this point, 
n:SN) If there is no interaction with P| across different state-save Intervals 

T|(m) and Tj(m-I) for all j, where and ji^l, then the state of 

P. can be restored by this rollback. 

This property implies that the rollback of a task TK where 

TK={Pjli=l,2, . .M) will be recovered from a failure if Pj for any i is not 

affected by the rolibacks of P| for all )?*i and if Pj rolls back n.^N steps at 
which Pj's state is restored. 


3.2 The Detection of Rollback Propagation 

Since every external memory reference is managed by the switch 
controiler, the switch controller should take responsibility for detecting 
rollback propagation and deciding on multi-step rollbacks. Suppose there 
are (N+1) SSUs at each module, then the maximum possible rollback step is 
N. Let the current state-save interval of module i be T.(k), then an 
n-step rollback will restore the module i to the beginning of interval 
Tj(k-n+1). For state-save interval n (n=l,2,3, . . . ,N), we assign two 

matrices KC^(M*M) and KP^(M*M) to represent the interaction during the 
state-save interval Tj(k-n+l). Every element in both matrices consists of a 
single bit. KC^(i,j) is set to 1 if an interaction occurs between module i 
and module j during the state-save intervals T.(k-n+1) and T.(k-n^l). If 
an interaction exists between the two during module j's previous state-save 
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Interval, Tj(k-n), then KPj^(l,j)=1. The steps for setting these elements 
and checking the rollback propagation are listed as follows: 

1. Reset both matrices to zero at the beginning of the task. 

2. When an interaction Is Issued from module I and directed to module 

j, then KC^(i,j) and KC.|(j,l) are set to 1. 

3. If module I saves Its state and moves to the next state-save 

interval, then for j-l,2,,,.,M 

(a) . KP^(j,i)=KP^(j,i) + KC1(i,j) (where + is logical OR operation) 

KC^(jJ)=0 

(b) . KCj^(i,j)=KC^.T(l,j), 

KP^(i,j)=KP^.^(i,j) for n=N,N-1,...,2 

(c) . KC,(l,j)=0, KP,(i,j)=0 

I I 

4. When module i rolls back n steps, the switch controller checks the 

corresponding two rows in matrices KC„ and KP^, namely KC^(i,j) 

n n n 

and KP^(i,j) for j=1,2,...,M. There are three possible conditions: 
1). If KP^(l,j)=1 then module j has to roll back (n+1) steps, 2). if 
KPj^(i,j)=0 and KC^(i,j)=1, then module j has also to roll back n 
steps. 3). if KP^(i,j)=0 and KC^(i,j)=0, then there Is no direct 
rollback propagation from module i to module j. 

Let us define RB.(n), n=1,2,...,N, to indicate the rollback step of 
module I. If module I rolls back n steps, then RBj(n)=l, otherwise 
RB.(n)=0. So, if RB|(n)=0 for all n, then module I does not have to roll 
back. From the above conclusions and definitions, the condition of having a 
successful rollback recovery for a task can be expressed as follows: 
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The rollback of a task will be successful If one of the following two 
conditions Is satisfied for all modules; 

1. RBj(n)=0, for all n. 

2. If there is an integer n such that RBj(n)=l, then either KP^(i,j)=0 
for all j=l,2,...,M, or there exist integers j and w such that 
KPj^(iJ)=1, RB.(w)=1, and w > n. 

An example is shown in Figure 4, where Figure 4(a) describes 
memory references. Figure 4(b) is the current contents of KC and KP 
matrices, and Figure 4(c) Is the result of rollback propagation. 


3.3 The Evalution of Multi-step Rollback 

If a module i fails at time t^ during the k-th state-save interval, 
Tj(k), then a single step rollback of module i is examined to see if it is 
sufficient to recover from the failure. The result may lead to rollback 
propagations and thus to multi-step rollbacks as previously discussed. Since 
the number of state-save units associated with each module is finite, the 
whole task may have to restart when all SSUs are exhausted, In this 
section a probability model is derived to evalute the coverage of the 
multi-step rollback recovery which indicates the effectiveness of present 
fault-tolerant mechanism. Suppose every module has (N+1) SSUs and the 
task is allocated to M modules including PMs and CMs. To derive the 
coverage, the following assumptions are made and notations used: 
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Aj 



The access matrix whose element a., represents the probability of 

u 

making a reference from module I to module j. The sum of all 

elements In one row must be equal to 1 for a processor module 
M 

I, i.e. 

j=l 


The probability that KPj^(lj’)=0, which means no interaction 
occurs between module I's and module j's (k-n+1)“th state saving 
instants. For simplicity b.. is assumed to be a constant for all 
n, i.e. t)jj^=bjj 2 =. • . .=bjj|^=bjj. The exact value of b- is difficult 
to solve. An approximate representation is used, i.e., 
b..=Prob((Bjj n B.|) U (B|j f) B-)), where Bjj is the event that a 
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arbitrary moment. 


^ijn' average probability of having direct rollback propagation 

from module I to module j due to an n-step rollback of module I. 

We also assume f.. to be a constant, f.., for all n. 
ijn ij 

Tj.: The probability that module j has to roll back because of the 

direct or indirect propagation if module I rolls back. Note rjj^l 
for all i. 

E: The matrix i,j=l ,2, . . . ,M, In which element e- is the 

average execution time for memory references issued from module 
I to module j. 
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The total execution time of a given task under an error free 
condition an^i without the time overhead for generating recovery 
blocks, 


The duration of tho k-th state-save Interval of module i. Because 
of the asynchrony between state-save Invocation and actual state 
saving/ T.(k) Is a random variable. If 1* ^ Is long enough such 
that there is always a state saving following every state-save 
invocation/ the mean of T.(k) is equal to T . To make the 

I 9 S 

analysis simple/ this duration is assumed to be constant and 
equal to the duration of state-save invocation interval, T . 

S> 9 


The time overhead for generating a recovery block. 

The total number of state savings before task completion. 

Nt=lV«Tss-Tsv)J- 

The average memory reference rate from module i to module j 
during the k-th state-save interval of module i. Occurrence of 
these mem>ry references is assumed to be a Poisson process with 
a time-varying parameter during the progress of task execution. 
In general, the memory references of processes can be divided 
into different phases which have a constant reference rate 
[15/16]. If IS moderately large, Ujjj^ could be assumed to be 
a constant during a state-save interval. 


To derive the coverage of a multi-step rollback, the probability of 


direct rollback propagation/ I.e, fjj, should be obtained first. From the 
above definitions and assumptions/ fy is the average probability that there 
exists at least one memory reference between module i and module j during 
one state-save interval. It can be expressed as follows; 


where g 


U 


'ii = 'ii 


(1/Np 


9ij * 9 

Nt 






( 1 ) 


(=1 


(1-exp(-(U|^l^)*Tgg)) represents the average 


probability of having an Interaction issued by module i and directed to 
module j during a single state-save interval. Since the total number of 


memory references between module I and module j is equal to 

M Nj 

S.®im*®im^^ £ ^“ijk^^’^ss' following 

relationship; 


Nj M 

ksl nv~l 


( 2 ) 


Also the maximum memory reference rate u.j|^ must be less than or 
equal to the reciprocal of e.^y that is 


1/eij i u.j^ i 0 


(3) 


With the above two constraints we can get the extrema of f|. as follows; 

1. The maximum value of f.., denoted as f..' occurs when u.. -=u.. « 

IJ ij U f < u 


• •=Ui 


iJ/N • 
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2. The minimum value of f,., denoted as f,,", occurs when there are h 

U M 

Intervals (where h=[N^*aj|/( which 

' m=l ' ^ 

(Ni-h-l) Intervals In which u,,^=0, and one Interval In which 

M 

m-**| 


To solve for rjj from f-, a fully connected network is drawn as 
Figure 6 In which every node represents a module^ and the link (l/j) 
connecting node I and node ) denotes the relationship for direct rollback 
propagation between module I and module j\ Then fjj can be considered as 
the probability of having a directly connected link between node I and node 
j. The theory of network reliability [17] can be used to ^olve for rj^; 


-•U “ L' 


- — (4) 


where D.j ^ is the probability that the q-th path from node I to node j is 
connected and 'U' is the probability union operation. With an additional 
assumption that the occurrence of failure is equally distributed over each 
module in a statistical sense, the coverage of a single step rollback, 
denoted by C('l), becomes 

MM M 

CO) = O/M) E n Eb||,)) (5) 

i=lj=l ‘ k=l ‘ 

And the accumulated coverage from a single step rollback to an h-step 
rollback can be derived by the following recursive equation: 


C(h) = C(1)(l-C(h-1))+C(h-1) 


( 6 ) 
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The coverage of the multi-step rollback recovery Is calculated for an 
example with the following access matrix i 


0.9 

0.08 

0.02 

0. 

0.1 

0.85 

0.03 

0.02 

0.03 

0.03 

0.9 

0,04 

0. 

0.02 

0.08 

0.9 


This example has the access localities 0.05 and 0.9 for processes which 
correspond to the experimental results obtained from Cm* [18], The 
numerical results are presented In Table 1 and are also plotted In Figure 
7, These results Include three casest the best coverage computed from 
for different values of N^/ and the worst coverage computed from fjj'. 
These results show that only a small number of SSUs Is enough to achieve 
a satisfactory coverage of rollback recovery, It should be particularly noted 
that the requirement of a small number of SSUs Is mandatory for actual 
implementation. 


4. THE PERFORMANCE OF ROLLBACK RECOVERY MECHANISM 

Several methods for analyzing the rollback recovery system have been 
proposed [19 - 22], They In general deal with a transactlon-orlented 
database system and compute the optimum value of the Intercheckpoint 
interval, Castillo and Siewlorek studied the expected execution time which Is 
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required to complete a task with the restart recovery method [23], All of 
these approaches either assume the state restoration is obtainable by a 
single checkpoint or do not include the rollback capability at all. In this 
section, we explicitly take Into account the problem of multi-step rollback 
and the risk of restart for the rollback recovery mechanism. 


4,1 Notations and Assumptions 

The following notations will be used in the sequel; 



T 


real’ 




roll,m‘ 


The total execution time to complete the giv , ^k with 
QGGurrencB of errors. It includes the requif, • * tlon 

time under error-free condition, the time lost due to 
rollbacks and restarts, and the time overhead for 
generating recovery blocks. 

The total execution time to complete the task without 
restart (i,e., eU failures are j^^ecovered by rollbacks). 

The time lost due to the j-th rollback in module m 
which consists of the set up time for resumption, tsb, 
and the computation undone by rollback. 
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The time lost due to the l-th restart which includes the 
set up time for restart, t^^, and the time between the 
previous start and the moment at which error is 

detected . 

TEj^! The accumulated effective computation before the k-th 

rollback when the task can be completed without 
restart. 

I 

I 

x|, (Xg)j The duration between the (j-l)-th and the j-th | 

rollbacks (the (i-l)-th and the i-th restarts). ? 

C(;i): The accumulated coverage of rollback recovery from a 1 

single step to i steps. This value is calculated by 
Equations (5) and (6) presented in the previous 

section. 

Pj^ probability of rollback (restart) when a failure 

occurs. 

Pg^(h); The probability of having an h-step rollback given the 

failure is recovered by the rollback. 

Pr(m): The probability of having m rollbacks during the time 

intefval, 

Zr(z), Zg^(z); The probability generating functions of Pr(m), 

respectively. 


23 


Tho charaetoHstlc function, of T^, re.p.ctivoly. 

The goal of our analysis Is to calculate the moan and variance of the 

total execution time of a given task. Suppose the task Is decomposed and 

then allocated to M modules. During the normal operation, the small 

overhead is required to generate consecutive recovery blocks In each 

module, When the )«th error occurs, modulo m spends to recover 

from this error If the error Is recoverable by a rollback. Otherwise, the 

whole task has to restart. t{,q||^i^^ consists of the set up time which is 

composed of the decision delay required for examining rollback propagation, 

the reconfiguration time, and the time used to make up for the computation 

undone by the rollback. Wo assume that the task completion time is 

postponed by maxC tL.. _) where msl.2... M for the rollback recovery of 

roil, nr 

the j-th error, The resultant completion time will be the upper bound 

because of the following reasons; (1) t|,^|| can be interpreted as the 

time lost due to the rollback in modulo m. So the total time lost in all the 

concerned modules is ^ t|,qII Since the completion of task is regarded 

m“l 

as the completions of all Its processes, the time lost from the task's point 
of view could be niax{Tj,^H but not larger than this maximal value, (2) 
The true delay impacted on the completion of task by a rollback will be 
shortened because of the possible reduction in the waiting time of process 
synchronization, To facilitate system reconfiguration, we also assume the 
multiprocessor has a sufficient number of modules so that the task may bo 
executed continuously from start to end without waiting for the availability 
of modules, The time needed for en'or-free execution Is regarded as 
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constant and is independent of reconfiguration. 


In general, the occurrence of error can be modeled as a Poisson 
process with parameter X(t) which equals the reciprocal of mean time 
between failures [24]. Since XCt) is slowly time-varying (for example with a 
period of one day), it is assumed to be constant over the duration of one 
task execution, i.e,, X(t)=X. For simplicity an error is assumed to be 
detected immediately whenever It occurs (see Section 3 for a brief 
description on relaxing this assumption). From the definitions of P^, P^, 
and Pg^(h), we have Pg=1-C(N’) where N' is the number of states saved 
and N’siN, and each module has (N+1) SSUs. Therefore the probability of 
rollback, Pj^, becomes C(N). Pg^(h) is equal to (1/Pj^)*(C(h)-C(h-l )) for 
h’=2,.. N, and Pg^(1 )=C(1 )/Pj^. The occurrence of rollback and restart can 
be modelled as Poisson processes with means ^b~^^b ^s"^^s' 

respectively. 


4,2 The Performancr* Model 

The total task execution time, T^, can be divided into several phases 
as shown in Figure 8. The last phase is always ended with the completion 
of task. Other phases are followed by a restart. So the amount of effective 
computation at the beginning of each phase is zero. During each phase, 
rollback recoveries are allowed so that the effective computation between 
rollbacks are accumulated toward the task completion. To derive the 
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distribution of m should determine the distribution of the duration of 
the last phase (which Is defined as the probability having R 

restarts, and the distribution of the durations of other phases which are 
dofinod as for M,2,..R. 

In the last phase, the task will bo executed from the beginning to 

the completion without any restart. Let denote the time overhead for 

generating a recovery bl'ock. The effective computation In a state-save 

Interval under the error-free condition Is It Is assumed that 

ss sv ef 

Is much larger than rollback distance of an 

h-step rollback con be approximated by h*T , The effective computation 
between two consecutive rollbacks becomes (X -h'''T„^)'*^ when a module rolls 
back h steps where (X)^*mox{0,X) Is the positive rectification function. 
With the probability having an h-step rollback, Pg^(h), two functions are 
presented; 

Z« ,rexp(-hX.T_)P .(h) (7) 

hta] 

H(t,k) = :c ,(t) (8) 

i=0 ' ' 

where G|^_|(t)is the (k-l)-th gamma distribution function with parameter Xb 
for (k-l)>0, and GQ(t)=l, In Appendix A, we show that the distribution 
function of the accumulated effective computation after m rollbacks Is 
Prob(TEj^ S t)sH(t,k). Therefore the probability having k rollbacks during 
the time Interval Pr(k), Is given by 
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( 9 ) 


Pr(k)«P(TEj^,, > T^f)-P(TEk > V 

'=H(Tg^,k)-H(T^f,l<+1) 

^real composed of and the time lost due to rollbacks which Is a sum 
of Identically distributed random variables, *^roll,m' 

Substituting the probability mass function of Pr(k) and P^^Ch), we get the 
characteristic function of which Is given below; 

^sal ° «P(-sTBf)(Z^(expt-stjt,)Z5t(8xp(-sT^j))) —-(10) 

From Figure 8, The total time can be represented as the sum of T 

I and the random sum of The characteristic function of T±. derived 

real rst t 

In Appendix B Is given In the following; 


O# ^ A \ 

. ^ \ . II s / ^\\ N Pr 'cn rp\ / I \i ^ / / i x*i \ / \ \ > 1 . - _ r 1 i N 

= 2- expC-siXg^nY^^'l .zii irfj,Qg|Uj^uvXg-s;/j k\u 

1=0 a J~0 

This equation shows a general expression of the total execution time. For 
the system without rollback recovery mechanism, we can substitute P^^l, 
P|^=0, and then 0,.Qa|Cs) becomes exp(-s*T^p. The result obtained from the 
above equation Is the same as that In [23] . The mean and variance of the 


total execution time can be obtained from 
In Figure 9, the mean execution time for the example In Section 3 Is 
plotted. It is obvious that the overhead of generating recovery block has 
an important effect on the rollback recovery method. Since the state 
savings are performed In parallel with the normal process execution, the 
overhead contains only the time required for the validation test. Since the 
embedded checking circuits are not cost-effective and complex [25], the 


and 


AiL 


s=0 


a s 


s=0 


respectively, 
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overhead of generating recovery block can be reduced with a completely 
self-checking mechanism. Figure 10 expresses the variance of execution time 
for the previous example, It suggests that the prediction of the total 
execution time could be more accurate if the rollback recovery mechanism Is 
used. This result Is expected Intuitively since the probability of restart Is 
reduced considerably, In a system with a higher probability of restart a 
larger and more uncertain recovery overhead is involved. 

Another interesting parameter is the duration of state-save invocation, 

T , The Interval has two mutually conflicting effects. Figure 7 points out 
ss 

the Increasing of T _ will Induce more rollback propagations and degrade 

s s 

the coverage ( a larger value of means a shorter state-save interval), 

Since the occurrence of error is distributed throughout the state-save 
Interval, the average computation loss due to rollbacks is proportional to 
the state-save duration. Therefore the increase of T^„, which invokes 
longer state-save Intervals, will introduce more computation loss and higher 
probability of restart. On the other hand, the percentage of the total time 
overhead for generating recovery blocks is reduced by the increase of T^^. 
The optimum value which minimizes the expected execution time can be 
found In Figure 11. The figure expresses that there exists a linear 
relationship between T, and T^^ when T^^ is small (where the overhead of 
generating recovery block dominates the final result). When T is greater 
than the optimum value, the loss due to recovery increases considerably 
because of the larger time loss in each rollback. 
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5. CONCLUSION 


We considered the design of a hardware recovery mechanism for a 
fault-tolerant multiprocessor with emphasis on a fast state-save operation 
which requires little time overhead. To permit processes to be general and 
to ensure programmer-transparency, recovery points are established 
automatically and regularly. This approach does not require high-level 

insertion strategies or limitations for setting up recovery points [6,7,8,26] 
and also does not require synchronization of state-save operations among 
different processors as does the COPRA system [27]. We derived 

mathematically the probability of multi-step rollback, the coverage of 
rollback recovery, and the risk of restart which are usually ignored in 

most existing analyses. The results in this work indicate that the 

performance of the rollback recovery mechanism is significantly dependent 

upon the risk of restart which can be minimized by a higher local hit 

ratio. So, the improvements are related to the partitioning, cooperation, 
and allocation of processes. 

Since the rollback mechanism used here only provides a recovery 
capability to tolerate the hardware faults in processor modules and common 
memory modules, further Improvements should be considered to achieve the 
overall system reliability. The reliability of the interconnection network can 
be obtained by using redundant hardware to form additional paths (e.g., 
additional stages in generalized cube network [28]) or by using reliable 

switches (e.g., 2X2 fault-tolerant switching element proposed in [29]). 

However, the faults occurred in the supplementary resources, like SSUs 


29 


and monitor switches, do not cause damages to the computation Itself but 
will change the recovery capability. Although the performabllity [30] of the 
system at a single state is not affected by SSU's, etc,, the overall lifetime 
performabllity is changed because of the degradation of recovery capability. 
A higher recovery capability can be gained by using hardware redundancy. 
For Instance, an additional standby monitor switch can either test the 

active monitor switch or replace the active one whenever It malfunctions. 

To deal with the performance of a fault recoverable and 
reconfigurable multiprocessor, the delay on the task completion time due to 
the errors is an Important parameter. In such a system one or more faults 
which cause the errors in the computation and the loss of a portion of 
function capabality may have no serious consequence to the completion of a 
given task. Moreover, the quality of the recovery procedure largely 

determines the distribution of the task completion time. The traditional 

reliability measures, such as reliability, availability, and computation 
capacity, taken separately, thus can not reflect the characteristics of this 
fault-tolerant system. However, the overhead required to treat an error, 

the contamination of error, and the effect on the task execution time, 
should be Included to represent the effectiveness of fault-tolerance. In this 
paper, we achieved the fast treatment of failure by the automatic rollback 
recovery mechanism, and estimated the mean and variance of the completion 
time for u given task under moderate assumptions. We also point out that 
the assumption of no latency between error detection and error occurrence 
can be relaxed if we know the confident rollback distance or the 


distribution of this latency. 


One major concern in most real-time applications, such as aircraft or 
Industrial control, etc., is whether the required task can be completed 
prior to a given deadline or not. The roilback mechanism associated with 
each module not only offers system modularity and simplicity, but provides 
fast recovery and accurate prediction of the task completion time. Hence 
the present fault-tolerant multiprocessor has a high potential use for critical 
real-time applications. 
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Appendix A, Calculation of the probability of having k rollback within 


From the dlflnltion of Pg^(h), the task will roll back h steps with 
probability Pg^(h) after a failure Yfithin the last phase tet the 

rollback distance for the j-th rollback recovery Is HoH which is 
approximately equcl to hT^^ with a probability Pg^(h), Thus the 
accumul,?/ted effective computation time before the k-th rollback# TE|^# Is 
given by 



"““"(A»1) 


Since the occurrence of rollback Is a Poisson process with parameter 

Xj^/ then the density function of xj, is Xj^exp(-X|^t) . The probability that 

(xI--t[_..)=0 Is P_x(h)(1-exp(”XuhT„.)) , The density function of 
r roil u oo 



N 

P5^(h)(l-exp(-X|^hTj.g))6(t)+exp(-Xj^t) S 

-->(A.2) 


where 6(t) Is impluse function. 
Thus f^ is simplified by 


Let Z represent 


h=l 


Pg^(h)exp(-X,^hTg3) 


f^(t)=n-Z)6(t)^exp(-Xj^t)Z -—(A. 3) 

The characteristic function of TEj^# which is equal to (^^(s))^ where 
is the characteristic function of Cx|,-T|,^j|), becomes 
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Taking the Inverse Laplace transform, the density function of TEj^ 
(denoted as Is obtained. Thus the distribution function of TEj^ 

becomes 


P(TE,^St) 


k-1 

= 5](^)(1-Z)'(Z)^“‘G|^^j(t)*0-Z)’ 

l«0 


(A. 5) 


where G|^_^j(t) Is the (k“i)-th gamma distribution function. 


Appendix B, Calculation of the characteristic function of total execution 
time, 0^ 


From Fig. 8, the total execution time T^ is the sum of T^^^j and 


Trst' Trst 


probability of T 


-t 

i = \ 

we I 


T^st there are t restarts. With the conditional 
lave the following equation: 






It is assumed that the time interval between the (i-l)-th and the i-th 
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restarts Is exponential distributed with mean 1/X„. Thus, for a given 
T„al' '*“* *•’* distrlbutad betwean 

^su '^'roal^^su density function, ■,/ which 1$ given In the 

following; 


f ft-*-t ) = Xsexp(*Xst) 

^rst,r^ Su' l-expC-XgT^^gi) 


for O^tST 


real 


(B.2) 


The probability of having i restarts for a given Tp^^j Is 


Since >f there are £ restarts before the task completion, 

then the characteristic function of T^ for a given T^,gg| becomes 


oo 

^t|Tpea,' = > = te.p(-sTp,„)) E '’rs|T,,g,(«(^st|Tp,g,("»' 

where (s) is the characteristic function of the time loss due to a 

restart for a given I.e., the Laplace transformation of f^^^ |(t). By 

substituting P„„tT C£) and (s) into equation (B.4) and 

'real 'real 

integrating with the density function of Tj,^gj, the characteristic function of 
T^ is obtained as Equation (11). 


0.75067 

0.68610 

0.44713 

0.93783 

0.90147 

0.69433 

0.98449 

0.96907 

0.83100 

0.99612 

0.99029 

0.90656 

0.99902 

0.99695 

0.94834 


case 1: with minimum f.. and N-=100 

U >■ 

case 2; with minimum f.. and N^=10 
case 3: with maximun f- 

Table 1. A Numerical Example for the Coverage 
of Multi-step Rollbacks 




P = processor CM = common memory 

S = switch AC = access controller 

MS = monitor switch SSU = state -save unit 

LM = local memory 


Figure 1. The Organization of a Fault-Tolerant Multiprocessor using 
a Rollback Recovery Mechanism 
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Fig. 2. Sequence of a Rollback Recovery 
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RB^(n)= 


1 


n=2 

otherwise 


r 1 n=2 

RB2(n)=K 

^0 otherwise 


f 1 n=2 

RBgCn)^ j 

^ 0 otherwise 


RB^(n)= • 


f\ 


0 


n=l 

otherwise 


(c) 


Figure 4. An Example of Rollback Propagation and Multi-step 
Rollback 
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Figure 6. The Rollback Propagation Network 










MEAN (T.-T.J (sec.) 



TIME- FAILURE FREE (sec.) 


Figure 9. Mean Time-Overhead vs. Error-Free Execution Time 
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Pigure 10. Variance of Time-Overhead vs. Error Execution Time 




